</font>
 Compactness = (Avg. Perimeter) ^2 / Area
 Circularity = (Avg. Radius) ^2 / Area
 Distance Circularity = Area / (Avg. Distance from border) ^2
 Radius Ratio = (Max. Radius – Min. Radius) / Avg. Radius
 Pr. Axis Aspect ratio = (Minor Axis / Major Axis)
 Max. Length Aspect Ratio = (Length Perp Man Length) / (Man. Length)
 Scatter Ratio = (Inertia about Minor Axis) / (Inertia about Major Axis)
 Elongatedness = Area / (Shrink width) ^2
 Pr. Axis Rectangularity = Area / (Pr. Axis length * Pr. Axis Width)
 Max. Length Rectangularity = Area / (Max. length * Length perp. to this)
 Scaled Variance (Scaled variance along major axis) = (2nd order moment about major axis) / area
 Scaled Variance 1 (Scaled variance along minor axis) = (2nd order moment about minor axis) / area
 Scaled radius of Gyration = (Minor Variance + Minor Variance) / area
 Scaled radius of gyration.1: (3rd order moment about major axis) / (Sigma minor) ^3
 Skewness about: (3rd order moment about minor axis) / (Sigma Major) ^3
 Skewness about.1: (4th order moment about major axis) / (Sigma Minor) ^4
 Skewness about.2: (4th order moment about minor axis) / (Sigma Major) ^4
 Hollowness Ratio = (Area of Hollows) / (Area of bounding Polygon)
</font>
Main purpose is to classify the vehicles into the respective labels using the numerical attributes from the given geometric features extracted from Silhouette by applying dimensionlaity reduction technique - PCA and train a model using Principal components instead of training the model using just the raw data.
# importing the necessary package for performing advanced mathematical operation
import pandas as pd
# importing the necessary package for managing data
import seaborn as sns
import matplotlib.pyplot as plt
# importing the necessary packages for visualisation
sns.set (color_codes = True) # it will add a nice background to the graphs
%matplotlib inline
# commmand to tell Python to display my graphs
sns.set_style(style= 'darkgrid')
# pre-processing method
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# methods and classes for evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn import metrics
from sklearn.model_selection import cross_validate
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('vehicle (1).csv')
df.head(20).T
df.tail(20).T
def indetailtable(df):
print(f'Dataset Shape: {df.shape}')
print('Total Number of rows in dataset= {}'.format(df.shape[0]))
print('Total Number of columns in dataset= {}'.format(df.shape[1]))
print('Various datatypes present in the dataset are: {}\n'.format(df.dtypes.value_counts()))
summary = pd.DataFrame(df.dtypes, columns = ['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary['Name'] = summary[['Name', 'dtypes']]
summary['Misssing_values'] = df.isnull().sum().values
summary['Unique_values'] = df.nunique().values
summary['Duplicate_values'] = df.duplicated().sum()
summary['1st value'] = df.loc[0].values
summary['2nd Value'] = df.loc[1].values
summary['844th Value'] = df.loc[843].values
summary['845th Value'] = df.loc[844].values
return summary
brief = indetailtable(df)
brief
There are basically 845 number of data points / observations and 19 number of columns/features present in the dataset.
Out of which 14 number of rows are of float64 data type, four are integer (int64), and one is object. The column ‘Class’ is object datatype in nature and it contains three unique names in it such as van, bus and car.
There are lot of missing values present in the dataset, so we need to perform appropriate operation to avoid presence of missing values and their adverse effect on the model performance.
Similarly, there are no duplicate values present in the dataset, so it can also be treated as clean in terms of duplicate values.
Except the class column which is our target column, all other columns are filled with many number of unique values.
The status column which is object in terms of datatype has three unique values such as Van, Car and Bus.
# We will encode the Categorical Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()
df.iloc[:,18] = label_encoder.fit_transform(df.iloc[:,18]).astype('float64')
df.info()
df.columns
df=df.rename(columns = {'pr.axis_aspect_ratio':'pr_axis_aspect_ratio', 'max.length_aspect_ratio':'max_length_aspect_ratio',
'pr.axis_rectangularity':'pr_axis_rectangularity', 'max.length_rectangularity':'max_length_rectangularity',
'scaled_variance.1':'scaled_variance_1', 'scaled_radius_of_gyration.1':'scaled_radius_of_gyration_1',
'skewness_about.1':'skewness_about_1', 'skewness_about.2':'skewness_about_2'})
df.columns
for value in ['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
'pr_axis_aspect_ratio', 'max_length_aspect_ratio', 'scatter_ratio',
'elongatedness', 'pr_axis_rectangularity', 'max_length_rectangularity',
'scaled_variance', 'scaled_variance_1', 'scaled_radius_of_gyration',
'scaled_radius_of_gyration_1', 'skewness_about', 'skewness_about_1',
'skewness_about_2', 'hollows_ratio', 'class']:
print(value,':', sum(df[value] == '?'))
# Any of the values in the dataframe is a missing Value?
df.isnull().values.any()
# total number of missing values present in the entire dataframe..
df.isnull().sum().sum()
# Replacing NaN with the mean of the column
df['circularity'].fillna(df.circularity.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['distance_circularity'].fillna(df.distance_circularity.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['radius_ratio'].fillna(df.radius_ratio.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['pr_axis_aspect_ratio'].fillna(df.pr_axis_aspect_ratio.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['scatter_ratio'].fillna(df.scatter_ratio.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['pr_axis_rectangularity'].fillna(df.pr_axis_rectangularity.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['scaled_variance'].fillna(df.scaled_variance.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['scaled_variance_1'].fillna(df.scaled_variance_1.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['scaled_radius_of_gyration'].fillna(df.scaled_radius_of_gyration.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['scaled_radius_of_gyration_1'].fillna(df.scaled_radius_of_gyration_1.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['skewness_about'].fillna(df.skewness_about.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['skewness_about_1'].fillna(df.skewness_about_1.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['skewness_about_2'].fillna(df.skewness_about_2.mean(),inplace = True)
# Replacing NaN with the mean of the column
df['elongatedness'].fillna(df.elongatedness.mean(),inplace = True)
# Any of the values in the dataframe is a missing Value?
df.isnull().values.any()
# total number of missing values present in the entire dataframe..
df.isnull().sum()
df.describe().T
In the above table 5 point summary have been described, i.e., minimum value, 25%, 50%, 75%, and maximum value.
• In the above statistical distribution analysis, it is noticed that, most of the features are having unique values and have a unique distribution.
• Each columns do have different units or scales. Some of the columns indicate measure of normal dimensions and some are ratios. Thus these columns have unique values and unique distribution as per their scales.
• The target column which is trinomial in nature has three categories or classes such as 0,1,2 for Bus, Car and Van respectively.
Further Checking individually
# Checking the value of counts for Status of PD
pd.value_counts(df['class'])
Here
plt.figure(figsize = (20,15))
plt.subplot(2,2,1)
pd.value_counts(df['class']).plot(kind = 'bar', color = 'purple'); # to plot a bar chart
plt.xlabel('Types of Vehicles');
plt.subplot(2,2,2)
# Data to plot
labels = 'Car', 'Bus', 'Van'
sizes = [429, 218, 199]
colors = ['yellowgreen', 'violet', 'orange']
explode = (0.1, 0, 0.1) # explode 1st slice
# Plot
plt.pie(sizes,explode=explode, labels=labels, colors=colors, # to plot a pie chart
autopct='%1.1f%%',shadow=True, startangle= - 135)
plt.xlabel('Types of Vehicles');
plt.axis('equal')
plt.show()
df.groupby(['class']).count()
Out of all the vehicle 50.7% are of Car, 23.5% are of van and 25.8% are of bus category.
The dataset is skewed in terms of target column, due to the uneven distribution of class of vehicles between car, bus and van. Car seems to be more on road than bas and van together.
And, here the main target or class of interest is to correctly identify the car, bus and van from their geometrical configuration.
Here, we do not want to misclassify any car as bus or van or vice versa, which may lead to bias the model during production. Thus our main aim is to minimize both type I and Type II error.
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
plt.hist(df.compactness, color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Compactness of Vehicle')
plt.subplot(3,3,2)
plt.hist(df.circularity, color = 'red', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Measure of circularity of vehicle')
plt.subplot(3,3,3)
plt.hist(df.elongatedness, color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Elongatedness of the Vehicle')
plt.subplot(3,3,4)
plt.hist(df.scaled_variance, color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Variance of the vehicle after scaling along Major axis')
plt.subplot(3,3,5)
plt.hist(df.scaled_radius_of_gyration, color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Radius of Gyration of vehicle')
plt.subplot(3,3,6)
plt.hist(df.pr_axis_rectangularity, color = 'violet', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Measure of Rectangularity of the vehicle') ;
plt.subplot(3,3,7)
plt.hist(df.pr_axis_aspect_ratio, color = 'teal', edgecolor = 'purple', alpha = 0.7);
plt.xlabel('Aspect Ratio between Major Axis and Minor Axis') ;
plt.subplot(3,3,8)
plt.hist(df.scatter_ratio, color = 'brown', edgecolor = 'yellow', alpha = 0.7);
plt.xlabel('Scatter ratio: Ratio between inertia about Major & Minor Axis') ;
plt.subplot(3,3,9)
plt.hist(df.radius_ratio, color = 'white', edgecolor = 'red', alpha = 0.7);
plt.xlabel('Radius Ratio between Maximum and Minimum Radius');
Comments:
From above all histogram plots for various independent attributes it can be observed that none of them are normally distributed.
Though the aspect ratio column seems to be normally distributed, it is highly skewed towards right. Some of the attributes are multimodal in nature and it can be due to the mixing of Gaussian at different point of data collection.
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.violinplot(df['compactness'], color = 'green')
plt.xlabel('Compactness of Vehicle')
plt.subplot(3,3,2)
sns.violinplot(df['circularity'], color = 'red')
plt.xlabel('Measure of circularity of vehicle')
plt.subplot(3,3,3)
sns.violinplot(df['elongatedness'], palette = 'gist_rainbow')
plt.xlabel('Elongatedness of the Vehicle')
plt.subplot(3,3,4)
sns.violinplot(df['scaled_variance'], color = 'orange')
plt.xlabel('Variance of the vehicle after scaling along Major axis')
plt.subplot(3,3,5)
sns.violinplot(df['scaled_radius_of_gyration'], color = 'purple')
plt.xlabel('Radius of Gyration of vehicle')
plt.subplot(3,3,6)
sns.violinplot(df['pr_axis_rectangularity'], color = 'violet')
plt.xlabel('Measure of Rectangularity of the vehicle') ;
plt.subplot(3,3,7)
sns.violinplot(df['pr_axis_aspect_ratio'], color = 'brown')
plt.xlabel('Aspect Ratio between Major Axis and Minor Axis') ;
plt.subplot(3,3,8)
sns.violinplot(df['scatter_ratio'], color = 'teal')
plt.xlabel('Scatter ratio: Ratio between inertia about Major & Minor Axis') ;
plt.subplot(3,3,9)
sns.violinplot(df['radius_ratio'], color = 'white')
plt.xlabel('Radius Ratio between Maximum and Minimum Radius');
From the above Violin plot, it can be inferred that, the columns 'Aspect ratio', 'scatter ratio' and 'radius ratio' are highly right skewed. 'Scaled variance' also seems to be right skewed.
A clear bimodal distribution curve can also be observed for the columns like 'Scaled Variance along Major axis', 'scatter ratio', 'Rectangularity measurement' and 'elongatedness' of the vehicle.
However, the order of modality among all these columns is not uniform. The aspect ratio attribute seems to be Normally distributed but it is again right skewed.
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
plt.hist(df.distance_circularity, color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Distance Circularity')
plt.subplot(3,3,2)
plt.hist(df.hollows_ratio, color = 'red', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Hollowness Ratio')
plt.subplot(3,3,3)
plt.hist(df.max_length_aspect_ratio, color = 'blue', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Maximum Length Aspect Ratio')
plt.subplot(3,3,4)
plt.hist(df.max_length_rectangularity, color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Maximum Length Rectangularity')
plt.subplot(3,3,5)
plt.hist(df.scaled_variance_1, color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Scaled Variance 1: Along Major Axis')
plt.subplot(3,3,6)
plt.hist(df.scaled_radius_of_gyration_1, color = 'violet', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Scaled Radius of Gyration-1') ;
plt.subplot(3,3,7)
plt.hist(df.skewness_about, color = 'teal', edgecolor = 'brown', alpha = 0.7);
plt.xlabel('Skewness About') ;
plt.subplot(3,3,8)
plt.hist(df.skewness_about_1, color = 'brown', edgecolor = 'yellow', alpha = 0.7);
plt.xlabel('Skewness About 1') ;
plt.subplot(3,3,9)
plt.hist(df.skewness_about_2, color = 'white', edgecolor = 'red', alpha = 0.7);
plt.xlabel('Skewness About 2') ;
From the above histogram plot, inference can be drawn as follows: none of the columns are normally distributed. Though the column for 'maximum rectangularity' seems to be distributed normally, it is little bit right skewed.
Columns like 'Distance Circularity', 'Hollowness ratio', 'Maximum length rectangularity', 'skewness about-2' are having abundant amount of data points and can be expected to influence the target column.
Some of the columns like 'maximum length aspect ratio', 'scaled radius of gyratio-1' do not have much data points, and can be seen as right skewed. A discontinuity during the distribution period says the presence of outliers in those columns. it means these kind of columns can bias the target attribute and can also hamper the model performance in production.
# Plotting the distribution of continous features individually
plt.figure(figsize = (20,15))
plt.subplot(3,3,1)
sns.violinplot(df['distance_circularity'], color = 'green')
plt.xlabel('Distance Circularity')
plt.subplot(3,3,2)
sns.violinplot(df['hollows_ratio'], color = 'red')
plt.xlabel('Hollowness Ratio')
plt.subplot(3,3,3)
sns.violinplot(df['max_length_aspect_ratio'], palette = 'gist_rainbow')
plt.xlabel('Maximum Length Aspect Ratio')
plt.subplot(3,3,4)
sns.violinplot(df['max_length_rectangularity'], color = 'orange')
plt.xlabel('Maximum Length Rectangularity')
plt.subplot(3,3,5)
sns.violinplot(df['scaled_variance_1'], color = 'purple')
plt.xlabel('Scaled Variance 1: Along Major Axis')
plt.subplot(3,3,6)
sns.violinplot(df['scaled_radius_of_gyration_1'], color = 'violet')
plt.xlabel('Scaled Radius of Gyration-1') ;
plt.subplot(3,3,7)
sns.violinplot(df['skewness_about'], color = 'brown')
plt.xlabel('Skewness About') ;
plt.subplot(3,3,8)
sns.violinplot(df['skewness_about_1'], color = 'teal')
plt.xlabel('Skewness About 1') ;
plt.subplot(3,3,9)
sns.violinplot(df['skewness_about_2'], color = 'white')
plt.xlabel('Skewness About 2');
# for doing statistical calculation
import scipy
from sklearn import linear_model
import statsmodels.api as sm
from sklearn import metrics
from sklearn import datasets
import scipy.stats as stats
from scipy.stats import skew
# Preparing a pandas dataframe to store the skewness of each column.
Skewness = pd.DataFrame({'Skewness': [stats.skew(df.compactness), stats.skew(df.circularity),
stats.skew(df.distance_circularity),stats.skew(df.radius_ratio),
stats.skew(df.pr_axis_aspect_ratio), stats.skew(df.max_length_aspect_ratio),
stats.skew(df.scatter_ratio),stats.skew(df.elongatedness),
stats.skew(df.pr_axis_rectangularity),stats.skew(df.max_length_rectangularity),
stats.skew(df.scaled_variance),stats.skew(df.scaled_variance_1),
stats.skew(df.scaled_radius_of_gyration), stats.skew(df.scaled_radius_of_gyration_1),
stats.skew(df.skewness_about), stats.skew(df.skewness_about_1),
stats.skew(df.skewness_about_2), stats.skew(df.hollows_ratio)]},
index = ['compactness', 'circularity', 'distance_circularity', 'radius_ratio','pr_axis_aspect_ratio',
'max_length_aspect_ratio', 'scatter_ratio','elongatedness', 'pr_axis_rectangularity',
'max_length_rectangularity','scaled_variance', 'scaled_variance_1',
'scaled_radius_of_gyration','scaled_radius_of_gyration_1', 'skewness_about',
'skewness_about_1','skewness_about_2','hollows_ratio'])
Skewness
plt.figure(figsize = (20,18))
plt.subplot(5,2,1)
sns.boxplot(x = df.compactness, color = 'green')
plt.xlabel('Compactness of Vehicle')
plt.subplot(5,2,2)
sns.boxplot(x = df.circularity, color = 'red')
plt.xlabel('Measure of circularity of vehicle')
plt.subplot(5,2,3)
sns.boxplot(x = df.elongatedness,palette = 'gist_rainbow')
plt.xlabel('Elongatedness of the Vehicle')
plt.subplot(5,2,4)
sns.boxplot(x = df.scaled_variance, color = 'orange')
plt.xlabel('Variance of the vehicle after scaling along Major Axis')
plt.subplot(5,2,5)
sns.boxplot(x = df.scaled_radius_of_gyration, color = 'purple')
plt.xlabel('Radius of Gyration of vehicle')
plt.subplot(5,2,6)
sns.boxplot(x = df.pr_axis_rectangularity, color = 'violet')
plt.xlabel('Measure of Rectangularity of the vehicle') ;
plt.subplot(5,2,7)
sns.boxplot(x = df.pr_axis_aspect_ratio, color = 'brown')
plt.xlabel('Aspect Ratio between Major Axis and Minor Axis') ;
plt.subplot(5,2,8)
sns.boxplot(x = df.scatter_ratio, color = 'teal')
plt.xlabel('Scatter ratio: Ratio between inertia about Major & Minor Axis') ;
plt.subplot(5,2,9)
sns.boxplot(x = df.radius_ratio, color = 'white')
plt.xlabel('Radius Ratio between Maximum and Minimum Radius');
plt.figure(figsize = (20,18))
plt.subplot(5,2,1)
sns.boxplot(x = df.distance_circularity, color = 'green')
plt.xlabel('Distance Circularity')
plt.subplot(5,2,2)
sns.boxplot(x = df.hollows_ratio, color = 'red')
plt.xlabel('Hollowness Ratio')
plt.subplot(5,2,3)
sns.boxplot(x = df.max_length_aspect_ratio, palette = 'gist_rainbow')
plt.xlabel('Maximum Length Aspect Ratio')
plt.subplot(5,2,4)
sns.boxplot(x = df.max_length_rectangularity, color = 'orange')
plt.xlabel('Maximum Length Rectangularity')
plt.subplot(5,2,5)
sns.boxplot(x = df.scaled_variance_1, color = 'purple')
plt.xlabel('Scaled Variance 1: Along Major Axis')
plt.subplot(5,2,6)
sns.boxplot(x = df.scaled_radius_of_gyration_1, color = 'violet')
plt.xlabel('Scaled Radius of Gyration-1') ;
plt.subplot(5,2,7)
sns.boxplot(x = df.skewness_about, color = 'brown')
plt.xlabel('Skewness About') ;
plt.subplot(5,2,8)
sns.boxplot(x = df.skewness_about_1, color = 'teal')
plt.xlabel('Skewness About 1') ;
plt.subplot(5,2,9)
sns.boxplot(x = df.skewness_about_2, color = 'white')
plt.xlabel('Skewness About 2');
sns.pairplot(df, hue="class");
df.corr().T
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (15,12))
plt.title('Pearson Correlation of attributes', y=1, size = 20)
sns.heatmap(df.corr(), linewidth = 0.2, vmax = 1.0,
square = True, cmap = colormap,linecolor = 'red', annot = True);
plt.figure(figsize = (20,24))
plt.subplot(4,3,1)
sns.scatterplot(df.compactness, df.circularity, hue = df['class'], palette = ['red', 'blue', 'orange'])
plt.title('compactness Vs. circularity', color = 'brown');
plt.subplot(4,3,2)
sns.scatterplot(df.radius_ratio, df.scatter_ratio, hue = df['class'], palette = ['purple', 'orange', 'green'])
plt.title('radius_ratio Vs. scatter_ratio', color = 'brown');
plt.subplot(4,3,3)
sns.scatterplot(df.elongatedness, df.compactness, hue = df['class'], palette = ['red', 'green', 'yellow']);
plt.title('elongatedness Vs. compactness', color = 'brown');
plt.subplot(4,3,4)
sns.scatterplot(df.elongatedness, df.circularity, hue = df['class'], palette = ['orange', 'green', 'violet']);
plt.title('elongatedness Vs. circularity', color = 'brown');
plt.subplot(4,3,5)
sns.scatterplot(df.scaled_radius_of_gyration_1, df.skewness_about_2, hue = df['class'], palette = ['blue', 'orange', 'brown']);
plt.title('scaled_radius_of_gyration_1 Vs. skewness_about_2 ', color = 'brown');
plt.subplot(4,3,6)
sns.scatterplot(df.scaled_radius_of_gyration_1, df.hollows_ratio, hue = df['class'], palette='nipy_spectral');
plt.title('scaled_radius_of_gyration_1 Vs.hollows_ratio', color = 'brown');
plt.subplot(4,3,7)
sns.scatterplot(df.skewness_about_2, df.hollows_ratio, hue = df['class'], palette='plasma_r');
plt.title('skewness_about_2 Vs. hollows_ratio', color = 'brown');
plt.subplot(4,3,8)
sns.scatterplot(df.pr_axis_rectangularity, df.max_length_rectangularity, hue = df['class'], palette = ['purple', 'green', 'orange']);
plt.title('pr_axis_rectangularity Vs. max_length_rectangularity', color = 'brown');
plt.subplot(4,3,9)
sns.scatterplot(df.pr_axis_aspect_ratio, df.scatter_ratio, hue = df['class'], palette = ['Blue', 'purple', 'green']);
plt.title('pr_axis_aspect_ratio Vs. scatter_ratio', color = 'brown');
plt.subplot(4,3,10)
sns.scatterplot(df.scaled_variance, df.scaled_variance_1, hue = df['class'], palette = ['violet', 'purple', 'green']);
plt.title('Scaled Variance along Major axis vs. Minor axis', color = 'brown');
plt.subplot(4,3,11)
sns.scatterplot(df.scaled_radius_of_gyration_1, df.skewness_about, hue = df['class'], palette = ['orange','purple', 'red']);
plt.title('scaled_radius_of_gyration_1 Vs. skewness_about', color = 'brown');
plt.subplot(4,3,12)
sns.scatterplot(df.skewness_about_1, df.skewness_about_2, hue = df['class'], palette = ['Blue', 'orange', 'green']);
plt.title('skewness_about_1 Vs. skewness_about_2', color = 'brown');
df_ind_var = df.drop('class', axis =1) # Separating the target column from independent column
from scipy.stats import zscore
df_ind_var_z = df_ind_var.apply(zscore) # imputing zscore to the numerical columns.
df_ind_var_z.head()
from sklearn.model_selection import train_test_split
X = df.drop('class', axis =1)
y = df['class']
X_z = X.apply(zscore)
X_z.head()
x_train, x_test, y_train, y_test = train_test_split(X_z, y, test_size = 0.3, random_state = 533)
x_train.head()
y_train.head()
# checking the split of data
print('{0:0.2f}% data is in training set'.format((len(x_train)/len(df.index))*100))
print('{0:0.2f}% data is in testing set'.format((len(x_test)/len(df.index))*100))
# Building a Support Vector Machine on trian data
svc_model_rbf = SVC( kernel = 'rbf')
svc_fit_rbf = svc_model_rbf.fit(x_train, y_train)
y_train_pred_svc_rbf = svc_fit_rbf.predict(x_train)
print('Training model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_train, y_train_pred_svc_rbf)*100))
y_test_pred_svc_rbf = svc_fit_rbf.predict(x_test)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_test, y_test_pred_svc_rbf)*100))
cmSVC_rbf = metrics.confusion_matrix(y_test, y_test_pred_svc_rbf, labels = [1.0,2.0, 0.0])
df_cmSVC_rbf = pd.DataFrame(cmSVC_rbf, index = [i for i in ['Actual bus', 'Actual van', 'Actual car']],
columns = [i for i in ['Predict bus', 'Predict van', 'Predict car']])
colormap = plt.cm.viridis
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC_rbf, annot = True, fmt = 'g', cmap = colormap, linecolor = 'red');
The confusion matrix Explanation:
print(metrics.classification_report (y_test, y_test_pred_svc_rbf, labels = [1.0,2.0,0.0]))
import numpy as np
y_grid = (np.column_stack([y_test, y_test_pred_svc_rbf])) # Checking the prediction value and actual value.
print(y_grid)
np.savetxt('ocr.txt', y_grid, fmt = '%s') # Saving the prediction value and actual value in a text file format.
resultsdf = pd.DataFrame({'Techniques Used': ['SVM with scaling'], 'Accuracy (%)': [accuracy_score(y_test, y_test_pred_svc_rbf).mean()*100], 'Std. Dev (%)': [accuracy_score(y_test, y_test_pred_svc_rbf).std()*100.0] })
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 25
seed = 235
kfold = KFold(n_splits = num_folds, random_state = seed)
svc_k_model = SVC ( kernel = 'rbf')
results = cross_val_score(svc_k_model, X, y, cv = kfold)
print(results)
print('\n')
print('Accuracy before applying z_score:%.3f%% (%.3f%%)'%(results.mean()*100.0, results.std()*100.0))
tempresultsdf = pd.DataFrame({'Techniques Used': ['SVM Kfold w/o scaling'], 'Accuracy (%)': [results.mean()*100], 'Std. Dev (%)': [results.std()*100.0] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
num_folds = 25
seed = 235
kfold = KFold(n_splits = num_folds, random_state = seed)
svc_k_model = SVC ( kernel = 'rbf')
results = cross_val_score(svc_k_model, X_z, y, cv = kfold)
print(results)
print('\n')
print('Accuracy after applying z_score:%.3f%% (%.3f%%)'%(results.mean()*100.0, results.std()*100.0))
tempresultsdf = pd.DataFrame({'Techniques Used': ['SVM Kfold with scaling'], 'Accuracy (%)': [results.mean()*100], 'Std. Dev (%)': [results.std()*100.0] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
corr = df.corr()
corr
columns =np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range (i+1, corr.shape[0]):
if corr.iloc[i,j]>=0.90:
if columns[j] :
columns[j] = False
selected_columns = df.columns[columns]
selected_columns.shape
selected_columns
df.columns
df1 = df[selected_columns]
# Assigning the selected columns after imputation of feature elimination
df_corr = df1.corr()
df_corr
# Checking for the correlation among thoese columns
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (15,12))
plt.title('Pearson Correlation of attributes', y=1, size = 20)
sns.heatmap(df1.corr(), linewidth = 0.2, vmax = 1.0,
square = True, cmap = colormap,linecolor = 'red', annot = True);
Here we will be selecting the columns based on how they affect the p-value. Here we are removing the column Class because it is the column we are trying to predict
selected_columns1 = selected_columns[:12].values
# removing the class column
selected_columns1
import statsmodels.api as sm
def backwardElimination(x, y, sl, columns):
numvars = len(x[0])
for i in range(0, numvars):
regressor_OLS = sm.OLS(y,x).fit()
maxvar = max(regressor_OLS.pvalues).astype(float)
if maxvar > sl:
for j in range (0, numvars - i):
if (regressor_OLS.pvalues[j].astype(float) == maxvar):
x = np.delete(x,j,1)
columns = np.delete(columns,j)
regressor_OLS.summary()
return x, columns
SL = 0.05 # Threshold limit is 0.05
data_modeled, selected_columns2 = backwardElimination(df1.iloc[:,:12].values, df1.iloc[:,12].values, SL, selected_columns1)
# Moving the result to a new dataframe
result = pd.DataFrame()
result['class'] = df.iloc[:,18]
result.head(10).T
# Creating a dataframe with the columns selected using the p-value and correlation
df2 = pd.DataFrame(data = data_modeled, columns = selected_columns2)
df2.head()
Plotting the data to visualize their distribution
# Creating a dataframe with the columns selected using correlation alone
data_modeled1 = df1.iloc[:, :12].values
df3 = pd.DataFrame(data = data_modeled1, columns = selected_columns1)
df3.head()
fig = plt.figure(figsize = (20,25))
j = 0
for i in df3.columns:
plt.subplot(6,4, j+1)
j+= 1
sns.distplot(df3[i][result['class']==0], color = 'r', label = 'Bus')
sns.distplot(df3[i][result['class']==1], color = 'b', label = 'Car')
sns.distplot(df3[i][result['class']==2], color = 'g', label = 'Van')
plt.legend(loc = 'best')
fig.suptitle('Vechile Data Analysis after feature elimination with using only correlation')
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()
fig = plt.figure(figsize = (20,25))
j = 0
for i in df2.columns:
plt.subplot(6,4, j+1)
j+= 1
sns.distplot(df2[i][result['class']==0], color = 'r', label = 'Bus')
sns.distplot(df2[i][result['class']==1], color = 'b', label = 'Car')
sns.distplot(df2[i][result['class']==2], color = 'g', label = 'Van')
plt.legend(loc = 'best')
fig.suptitle('Vechile Data Analysis after feature elimination with both p value and correlation')
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()
Here, we have taken a threshold value for correlation as 0.9, and eliminated the one of those columns which has a correlation value of 0.9, and a p-value of 0.05.
Here, we have also missed lot of information while deleting those columns.
x_train2, x_test2, y_train2, y_test2 = train_test_split(df2.values, result.values, test_size = 0.3, random_state = 256)
We are using a Support Vector Classifier with a RBF kernel to make the predictions. we will train the model on our training data and calculate the accuracy on test data.
# Building a Support Vector Machine on trian data
svc_model_rbf2 = SVC( kernel = 'rbf')
svc_fit_rbf2 = svc_model_rbf2.fit(x_train2, y_train2)
y_train_pred_svc_rbf2 = svc_fit_rbf2.predict(x_train2)
print('Training model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_train2, y_train_pred_svc_rbf2)*100))
y_test_pred_svc_rbf2 = svc_fit_rbf2.predict(x_test2)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_test2, y_test_pred_svc_rbf2)*100))
from sklearn.metrics import confusion_matrix
cmSVC_rbf2 = metrics.confusion_matrix(y_test2, y_test_pred_svc_rbf2, labels = [0,1,2])
df_cmSVC_rbf2 = pd.DataFrame(cmSVC_rbf2, index = [i for i in ['Actual bus', 'Actual car', 'Actual van']],
columns = [i for i in ['Predict bus', 'Predict car', 'Predict van']])
colormap = plt.cm.viridis
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC_rbf2, annot = True, fmt = 'g', cmap = colormap, linecolor = 'red');
The confusion matrix Explanation:
print(metrics.classification_report (y_test2, y_test_pred_svc_rbf2, labels = [0,1,2]))
tempresultsdf = pd.DataFrame({'Techniques Used': ['SVM after feature elimination w/o scaling'], 'Accuracy (%)': [accuracy_score(y_test2, y_test_pred_svc_rbf2).mean()*100], 'Std. Dev (%)': [accuracy_score(y_test2, y_test_pred_svc_rbf2).std()*100.0] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
df2_z = df2.apply(zscore)
df2_z.head(3)
x_train2_z, x_test2_z, y_train2_z, y_test2_z = train_test_split(df2_z.values, result.values, test_size = 0.3, random_state = 256)
# Building a Support Vector Machine on trian data
svc_model_rbf2_z = SVC( kernel = 'rbf')
svc_fit_rbf2_z = svc_model_rbf2_z.fit(x_train2_z, y_train2_z)
y_train_pred_svc_rbf2_z = svc_fit_rbf2_z.predict(x_train2_z)
print('Training model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_train2_z, y_train_pred_svc_rbf2_z)*100))
y_test_pred_svc_rbf2_z = svc_fit_rbf2_z.predict(x_test2_z)
print('Testing Model Accuracy value: {0:0.2f}%'.format(accuracy_score(y_test2_z, y_test_pred_svc_rbf2_z)*100))
from sklearn.metrics import confusion_matrix
cmSVC_rbf2_z = metrics.confusion_matrix(y_test2_z, y_test_pred_svc_rbf2_z, labels = [0,1,2])
df_cmSVC_rbf2_z = pd.DataFrame(cmSVC_rbf2_z, index = [i for i in ['Actual bus', 'Actual car', 'Actual van']],
columns = [i for i in ['Predict bus', 'Predict car', 'Predict van']])
colormap = plt.cm.viridis
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC_rbf2_z, annot = True, fmt = 'g', cmap = colormap, linecolor = 'red');
The confusion matrix Explanation:
Total number of vehicles correctly predicted by the model are 66, 60 and 117 for bus, van and car respectively.
Total number of vehicles incorrectly predicted by the model are 11 out of 846. This is the combination of both Type I error and Type II error.
This is done after feature elimination.
print(metrics.classification_report (y_test2_z, y_test_pred_svc_rbf2_z, labels = [0,1,2]))
tempresultsdf = pd.DataFrame({'Techniques Used': ['SVM after feature elimination with scaling'], 'Accuracy (%)': [accuracy_score(y_test2_z, y_test_pred_svc_rbf2_z).mean()*100], 'Std. Dev (%)': [accuracy_score(y_test2_z, y_test_pred_svc_rbf2_z).std()*100.0] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
x_train_z, x_test_z, y_train, y_test = train_test_split(X_z, y, test_size = 0.3, random_state = 533)
x_train_z.head()
cov_matrix = np.cov(X_z.T) # the relevent covariance matrix
print('Covariance Matrix \n%s', cov_matrix)
e_vals, e_vecs = np.linalg.eig(cov_matrix)
print('Eigenvectors \n%s'%e_vecs)
print('\nEigenValues \n%s'%e_vals)
tot = sum(e_vals)
var_exp = [(i/ tot)* 100 for i in sorted(e_vals, reverse = True)]
cum_var_exp = np.cumsum(var_exp)
print('Cumulative Variance Expalained', cum_var_exp)
plt.plot(var_exp, color = 'red');
# Scree Plot
plt.figure(figsize = (10,5))
plt.bar(range(1, e_vals.size +1), var_exp, alpha = 0.5, align = 'center', label = 'Individaul explained variance', color = 'red')
plt.step(range(1, e_vals.size+1), cum_var_exp, where = 'mid', label = 'Cumulative explained Variance', color = 'green')
plt.ylabel('Expalind Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
# Importing scikit learn PCA.
# It does all the above steps and maps data to PCA dimensions in one shot
from sklearn.decomposition import PCA
# Note: We are generating only 9 PCA dimenisons (Dimensionalaity reduction from 18 to 9)
pca = PCA(n_components = 9)
X_z_PCA = pca.fit_transform(X_z)
X_z_PCA.transpose()
pca.components_
sns.pairplot(pd.DataFrame(X_z_PCA));
df_comp = pd.DataFrame(pca.components_, columns = list(X_z))
df_comp.head(3)
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (15,12))
plt.title('Pearson Correlation of attributes', y=1, size = 20)
sns.heatmap(df_comp.corr(), linewidth = 0.2, vmax = 1.0,
square = True, cmap = colormap,linecolor = 'red', annot = True);
plt.figure(figsize = (12,6))
sns.heatmap(df_comp, cmap = 'viridis');
x_train_pca, x_test_pca, y_train_pca, y_test_pca = train_test_split(pd.DataFrame(X_z_PCA), y, test_size = 0.3, random_state = 533)
y_train_pca.head()
x_test_pca.head(3)
# checking the split of data
print('{0:0.2f}% data is in PCA training set'.format((len(x_train_pca)/len(df.index))*100))
print('{0:0.2f}% data is in PCA testing set'.format((len(x_test_pca)/len(df.index))*100))
# Building a Support Vector Machine on trian data
svc_model_rbf_pca = SVC( kernel = 'rbf')
svc_fit_rbf_pca = svc_model_rbf_pca.fit(x_train_pca, y_train_pca)
y_train_pred_svc_rbf_pca = svc_fit_rbf_pca.predict(x_train_pca)
print('Training model Accuracy value after PCA: {0:0.2f}%'.format(accuracy_score(y_train_pca, y_train_pred_svc_rbf_pca)*100.0))
y_test_pred_svc_rbf_pca = svc_fit_rbf_pca.predict(x_test_pca)
print('Testing Model Accuracy value after PCA: {0:0.2f}%'.format(accuracy_score(y_test_pca, y_test_pred_svc_rbf_pca)*100.0))
cmSVC_rbf_pca = metrics.confusion_matrix(y_test_pca, y_test_pred_svc_rbf_pca, labels = [1.0,2.0,0.0])
df_cmSVC_rbf_pca = pd.DataFrame(cmSVC_rbf_pca, index = [i for i in ['Actual car', 'Actual van', 'Actual bus']],
columns = [i for i in ['Predict car', 'Predict van', 'Predict bus']])
colormap = plt.cm.viridis
plt.figure(figsize = (8,5))
sns.heatmap(df_cmSVC_rbf_pca, annot = True, fmt = 'g', cmap = colormap, linecolor = 'red');
print(metrics.classification_report (y_test_pca, y_test_pred_svc_rbf_pca, labels = [1.0,2.0,0.0]))
tempresultsdf = pd.DataFrame({'Techniques Used': ['SVM after PCA imputation'], 'Accuracy (%)': [accuracy_score(y_test_pca, y_test_pred_svc_rbf_pca).mean()*100], 'Std. Dev (%)': [accuracy_score(y_test_pca, y_test_pred_svc_rbf_pca).std()*100.0] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
num_folds = 25
seed = 235
kfold_pca = KFold(n_splits = num_folds, random_state = seed)
svc_k_model_pca = SVC ( kernel = 'rbf')
results_pca = cross_val_score(svc_k_model, X_z_PCA, y, cv = kfold)
print(results_pca)
print('\n')
print('Accuracy after performing PCA:%.3f%% (%.3f%%)'%(results_pca.mean()*100.0, results_pca.std()*100.0))
tempresultsdf = pd.DataFrame({'Techniques Used': ['SVM after PCA imputation with K-fold'], 'Accuracy (%)': [results_pca.mean()*100], 'Std. Dev (%)': [results_pca.std()*100.0] })
resultsdf = pd.concat([resultsdf, tempresultsdf])
resultsdf = resultsdf[['Techniques Used', 'Accuracy (%)', 'Std. Dev (%)']]
resultsdf
resultsdf
fig = plt.figure(figsize = (22,5))
plt.title ('Accuracy values for various models/techniques',y=1, size = 22, color = 'red')
sns.barplot(y = resultsdf['Accuracy (%)'], x = resultsdf['Techniques Used'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );
plt.ylabel('Accuracy in %', size = 20)
plt.xlabel('Methodologies applied to calculate performance measurement', size = 20)
plt.tight_layout()
Here we have compared the accuracy scores of SVC models built with raw data and Principal Components.
The comparison is also drawn among the SVC models built with K-Fold cross validation feature engineering technique.
The main point to be considered here is the difference in independent attributes for raw data and Principal components. In raw data we have total 18 number of independent attributes, whereas in case of PCA, we collected total 7 number of independent components which captured about 95% of the variance explained.
Thus we moved with a little more, by taking an extra attribute, and built the model with total 8 number of PCs or independent columns and subsequently calculated the accuracy.
The model built with raw data gave an accuracy score of 96.85% and the model built with Principal Components gave an accuracy of 95.27%.
Similarly we have used K-Fold CV to calculate the accuracy of the models by taking both raw data and Principal Components.
Hence, we got an accuracy of 68.919% while building the model with raw data without scaling, an accuracy of 96.57% while building the model with raw data after scaling, and got an accuracy of 95.629% while building the model with Principal Components.
For K-Fold CV, we have taken total 25 number of folds i.e. the value of K as 25.
However, any number of value can be taken based upon the number of data points, independent features and our convenience.
We have also reflected the range estimate values for K-Fold cross validation technique, however, the same was not possible with train test split method so we imitated the point estimate value of accuracy.
Here, we have also tried to do some feature engineering activities by imputing the process like OLS (Ordinary Least Square) and feature elimination based upon the Correlation value among the columns and p-value of independent columns.
Thus after imputing the feature elimination process, we eliminated the columns and end up with only 10 columns. So after building the model and testing we achieved the accuracy score of around 56.29% before scaling the dataset and 95.66% after scaling the dataset.
Thus at the end we can reach out to the conclusion that, the SVC model with PCA and K-Fold has the best accuracy score among all the models. This also gives a range estimate of the accuracy at 95% confidence level.
It means the accuracy can vary between the standard deviation of 2.77% from both negative and positive sides.
Since, we have an accuracy score of 95.629 % with corresponding standard deviation value of 2.77% we can give a range estimate of accuracy between 92.505 % and 98.399 %. And this is with a confidence level of 95 %. That means with 95 % of confidence level we can say our model accuracy will be between 92.505 % and 98.399 % in production.
The cause of considering the above explained model is less number of columns. i.e. with only 8 number of columns it is giving an average accuracy of 95.629 %, and due to the presence of less number of attributes the model is also less complex and least prone to overfitting. Thus, with a very simple model we are achieving the best accuracy score. Hence the SVC model with PCA and K-Fold CV is the best model for production.